Large-scale avian vocalization detection delivers reliable global biodiversity insights

Tracking biodiversity and its dynamics at scale is essential if we are to solve global environmental challenges. Detecting animal vocalizations in passively recorded audio data offers an automatable, inexpensive, and taxonomically broad way to monitor biodiversity. However, the labor and expertise required to label new data and fine-tune algorithms for each deployment is a major barrier. In this study, we applied a pretrained bird vocalization detection model, BirdNET, to 152,376 h of audio comprising datasets from Norway, Taiwan, Costa Rica, and Brazil. We manually listened to a subset of detections for each species in each dataset, calibrated classification thresholds, and found precisions of over 90% for 109 of 136 species. While some species were reliably detected across multiple datasets, the performance of others was dataset specific. By filtering out unreliable detections, we could extract species and community-level insight into diel (Brazil) and seasonal (Taiwan) temporal scales, as well as landscape (Costa Rica) and national (Norway) spatial scales. Our findings demonstrate that, with relatively fast but essential local calibration, a single vocalization detection model can deliver multifaceted community and species-level insight across highly diverse datasets; unlocking the scale at which acoustic monitoring can deliver immediate applied impact.

Brazil: We surveyed 25 sites municipality of Mãe do Rio, in the Pará State in February 2022.Sites were in located in riparian forests and pastures and spaced at least 500m apart.777 hours of single channel audio was recorded on a schedule of 1 minute every 5 minutes for 24 hours at a sample rate of 48 kHz using 25 Audiomoth 2 devices mounted between 1-2m from ground level.Audio data was stored in uncompressed WAV format.

BirdNET model
We used a convolutional neural network (CNN) model, BirdNET, to detect bird vocalisations 3 .For Norway, we used BirdNET-Lite (https://github.com/kahst/BirdNET-Lite)accessed in November 2022.For Taiwan, Costa Rica, and Brazil, we used the newer BirdNET-Analyzer model v2.3 (https://github.com/kahst/BirdNET-Analyzer)through the birdnetlib Python library.Re-analysing the Norway data with BirdNET-Analyzer would have been ideal, but we lacked the resources to relabel the randomly selected detections, and the conclusions of our study were unlikely to have changed.Location data was provided to BirdNET to filter for only species expected at each recorder (based on eBird observations).
To acquire valid BirdNET inputs we resampled audio to 48kHz and generated non-overlapping three-second spectrograms (window size 8ms, 25% overlap, 64 Mel-scaled frequency bands).In Norway, adjacent detections of the same species were joined together, but in other datasets each three-second detection was kept independently.

Model calibration and performance
For each species in each dataset, we randomly selected 50 detections with BirdNET classification confidences over 0.80.Species with fewer than 50 detections were discounted, leaving 136 species total.Experts familiar with each study region listened to the randomly selected detections and labelled each BirdNET prediction as correct, incorrect, or unsure (e.g., if the call was ambiguous).Conservatively, we denoted unsure labels as false positives.
We then measured precision, p, as p = Tp/(Tp+Fp), where Tp and Fp were true and false positives, respectively, using 20 BirdNET classification thresholds spaced evenly between 0.80 and 0.99, inclusive.For each species in each dataset, we determined optimal thresholds by choosing the lowest threshold that achieved 90% precision.For species where 90% precision was not reached using any threshold value, the lowest threshold that maximised precision was chosen.
There was no correlation between calibrated classification thresholds and species precisions in the Taiwan, Norway, and Brazil datasets, but in Costa Rica there was a negative correlation (Pearson's R=-0.60, p=0.007).
We found calibrating classification thresholds did not greatly change the numbers of detections in each dataset.For the 136 species with at least 50 detections in each dataset, without calibration (i.e., using 0.8 as a threshold for all species) there were 625,113 total detections across all datasets, whereas with calibrated thresholds there were 616,046 total detections; only 1.45% of detections were lost.
In downstream analyses (i.e., Figs. 2 and 3), we only considered detections with confidences above the optimal thresholds determined for each species in each dataset.

Species and community analyses
For the following analyses, we used calibrated classification thresholds and only considered species with over 90% precision, and which had at least 20 detections over the calibrated classification thresholds in the dataset.
Brazil: We binned detections by hour of day to measure diel patterns in vocal activity.To normalise the data shown in Fig. 3a, for each species, we divided hourly numbers of detections by the maximum number of detections in a one-hour period.Species were ordered in the visualisation by their peak hour of vocal activity.
Costa Rica: For each day at each site, we counted the number of vocalisations detected for the Yellow-throated Toucan and normalised for sampling effort by dividing by the number of minutes recorded on that day.This allowed us to produce a measure of species vocalisation rate which accounted for imbalances in sampling effort across the sites.Then, to produce Fig. 3b, we grouped vocalisation rates for each day at each site by habitat type using land use maps provided by NASA, created at a scale of 5 × 5 m using Landsat 5 Thematic Mapper (TM) and Landsat 8 Operational Land Imager (OLI).
Norway: We focused on detections of the Willow Warbler between 30/04/2022 and 15/06/2022, and sites were aggregated by region.To show temporal variations in relative vocal activity across regions in Fig. 3c, detections were binned by day, and we divided daily numbers of detections by the maximum number of detections in a day for each region.
Taiwan: We combined detections across all sites and derived daily occurrence (presence/absence) data for each species.For Fig. 3d, we smoothed the daily occurrence data by convolving the binary occurrence time series for each species with a vector of ones of shape [1x7] (i.e., at the weekly resolution).Species were ordered by clustering the daily occurrence data to group species with similar temporal dynamics across the monitoring period (e.g., breeding migratory, wintering migratory, resident, etc.).